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Abstract —In this paper, we propose a novel fitting method 
that uses local image features to fit a 3D Morphable Model 
to 2D images. To overcome the obstacle of optimising a cost 
function that contains a non-differentiable feature extraction 
operator, we use a learning-based cascaded regression method 
that learns the gradient direction from data. The method allows to 
simultaneously solve for shape and pose parameters. Our method 
is thoroughly evaluated on Morphable Model generated data and 
first results on real data are presented. Compared to traditional 
fitting methods, which use simple raw features like pixel colour 
or edge maps, local features have been shown to be much more 
robust against variations in imaging conditions. Our approach 
is unique in that we are the first to use local features to fit a 
Morphable Model. 

Because of the speed of our method, it is applicable for real¬ 
time applications. Our cascaded regression framework is available 
as an open source librarjf] 

Keywords — 3D Morphable Model, cascaded regression, local 
features, SIFT, 3D reconstruction, supervised descent. 

I. Introduction 

This work tackles the problem of obtaining a 3D rep¬ 
resentation of a face from a single 2D image, which is an 
inherently ill-posed problem with many free parameters: the 
orientation of the face, identity and lighting, amongst others. 
A 3D Morphable Eace Model (3DMM) |[T|, @ usually consists 
of a PCA model of shape and one of color (albedo), camera 
and a lighting model. To fit the model to a 2D image, or in 
other words, to reconstruct all these model parameters, a cost 
function is set up and optimised. 


values or edge maps, while in a lot of other domains, most 
notably 2D facial landmark detection, these have long been 
superseded by local image features like Histogram of Oriented 
Gradient (HoG) O or Scale Invariant Feature Transform 
(SIFT) HH. However, it is non-trivial to use such features to 
fit 3D Morphable Models: they are non-differentiable operators 
and have not yet been used for 3DMM fitting. 

Recently, cascaded-regression based methods have been 
widely used with promising results in pose estimation O, 
Qa and 2D face alignment UB, M, G3, mi. In general, a 
discriminative cascaded regression based method performs op¬ 
timisation in local feature space by applying a learning-based 
approach to circumvent the problem of non-differentiability. 
The method allows to learn the gradient of a function from 
data, instead of differentiating. In this paper, we propose to 
use a similar strategy to perform 3DMM fitting using SIFT 
features. 

In comparison with existing 3DMM fitting algorithms, 
using image features and a cascaded regression based approach 
has the potential to give the best of the two worlds: it is 
fast, like the linear landmarks-only fitting methods, and at 
the same time robust to changing image conditions, and it 
can be potentially more precise, because image information is 
used to fit shape and pose, and not only landmarks. Further¬ 
more, it is possible to solve for pose and shape parameters 
simultaneously, instead of iteratively. This paper introduces 
and evaluates the proposed novelty in the context of fitting 
pose and shape parameters. 


Existing so-called fitting algorithms that define and solve 
these cost functions can loosely be classified into two cate¬ 
gories: the ones with linear cost functions and those with non¬ 
linear cost functions. Algorithms that fall into the first category 
typically use only facial landmarks like eye or mouth corners 
to fit shape and camera parameters, and use image information 
(pixel values) to fit an albedo and light model lO, H. Often, 
these steps are separate and applied iteratively. The second 
class of algorithms traditionally consists of a more complex 
cost function, using the image information to perform shape 
from shading, edge detection, and applying a nonlinear solver 
to jointly or iteratively solve for all the parameters Q, a, 
Q, 0. Recently, Schonborn et al. proposed a Markov Chain 
Monte Carlo based fitting method that integrates automatic 
landmark detections 0, cni. Most of these algorithms require 
several minutes to fit a single image. 

A common point of these algorithms is that they use either 
only landmark information or simple features like raw color 

* https://github.com/patrikhuber 


In the rest of this paper, we first give a brief introduction 
to the cascaded regression method and 3DMMs (Section 0 - 
In Section III we present the concept of using local image 
features to fit a 3D Morphable Model, and show how cascaded 
regression is applied to optimise the Morphable Model param¬ 
eters in local feature space. We thoroughly evaluate our method 
using pose and shape data generated from the Morphable 
Model, as well as on PIE fitting results (Section [TV] i. SectionfV] 
concludes the paper. 


II. Background 

Given an input face image I and a pre-trained 3DMM, our 
goal is to find the pose and shape parameters of the 3DMM 
that best represent the face. In our setting, a face box or facial 
landmarks are given to perform a rough initialisation of the 
model. The goal is then to obtain an accurate fitting result 
using a cost function that incorporates the image information 
in the form of local features. To facilitate this, the cascaded 
regression framework is used to learn a series of regressors. 




In this section, we will briefly introduce the generic cas¬ 
caded regression framework and the 3D Morphable Model. 
The 3D Morphable Model fitting using local image features 
will then be introduced in Section uni 

A. Cascaded Regression 

Given an input image I and a pre-trained model Cl{6) with 
respect to a parameter vector 9, the aim of a regression based 
method is to iteratively update the parameters 6 ^ 6 + 56 
to maximise the posterior probability p{6\I,Cl). A regression 
based method solves this non-linear optimisation problem by 
learning the gradient from a set of training samples in a 
supervised manner. The goal is to And a regressor: 

R: f{l, 9) ^ 59, (1) 

where f(I,0) is a vector of extracted features from the input 
image, given the current model parameters, and 59 is the pre¬ 
dicted model parameter update. This mapping can be learned 
from a training dataset using any regression method, e.g. linear 
regression Qa, m, random forests ca or artificial neural 
networks ini. In contrast to these regression algorithms, 
a cascaded regression method generates a strong regressor 
consisting of N weak regressors in cascade: 

R = Ri o ... o Rj^, (2) 

where is the nth weak regressor in cascade. In this paper, 
we use a simple linear regressor: 

= A„f(I,0) + b„, (3) 

where A„ is the projection matrix and b„ is the offset (bias) 
of the nth weak regressor. 

More specifically, given a set of training samples 
{i(li,0i),59i}f^^, we first apply the ridge regression algo¬ 
rithm to learn the first weak regressor by minimising the loss: 

M 

^ ||Aif(I„0,) + bi - 59£ + AIIAill^, (4) 

i=l 

and then update the training samples, i.e. the model parameters 
and the corresponding feature vectors, using the learned re¬ 
gressor to generate a new training dataset for the second weak 
regressor learning. This process is repeated until convergence 
or exceeding a pre-deflned maximum number of regressors. 

In the test phase, these pre-trained weak regressors are 
progressively applied to an input image with an initial model 
parameter estimate to update the model and output the final 
fitting result. 

B. The 3D Morphable Model 

A 3D Morphable Model consists of a shape and albedo 
(color) PCA model, of which we use the shape model in 
this paper. It is constructed from 3D face meshes that are 
in dense correspondence, that is, vertices with the same 
index in the mesh correspond to the same semantic point 
on each face. In 3DMM, a 3D shape is expressed as v = 

, where are the coor¬ 

dinates of the uth vertex and V is the number of mesh vertices. 
PCA is then applied to this data matrix consisting of m stacked 
3D face meshes, which yields m — 1 eigenvectors V^, their 


corresponding variances af, and a mean shape v. A face can 
then be approximated as a linear combination of the basis 
vectors: 

m—1 

V = V -k ^ aiVi, (5) 

i=l 

where a = [ai, ...,am.-iY' is the shape coefficient or param¬ 
eter vector. 

III. 3D Morphable Model Fitting using Local 
Image Features 

In this section, we will present how we formulate the 
cascaded regression approach to perform model-fitting using 
local features. 

A. Local Feature based Pose Fitting 

To estimate the pose of the 3D model, we select the 
parameters vector 9 to be 9 = [rx,ry,rz,tx,ty,t,z]'^, with 
Tx, Ty, and Tz being the pitch, yaw and roll angle respectively, 
and tx, ty and tz the translations in 3D model space. We can 
then project a point p® = 1]^ in homogeneous 

coordinates from 3D space to 2D using a standard perspective 
projection: 

p2D = p X T X X Rj; X R 2 X p®, (6) 

where 'R.x,y,z, T and P are 4 x 4 rotation, translation and 
projection matrices respectively, constructed from the values 
in 9, followed by perspective division and converting to screen 
coordinates. 

From the full 3D model’s mesh, we choose a subset of 
n 3D vertices from the mean shape model v in homogeneous 
coordinates, i.e. G i G 0 ... n—1. Given the current pose 
parameters 9 we then project them onto the 2D image to obtain 
a set of 2D coordinates v®. Next, local features are extracted 
from the image around these projected 2D locations, resulting 
in n feature vectors {fi,...,f„}, where f^ G M'^, d = 128 
in our case of SIFT features. These feature vectors are then 
concatenated to form one Anal feature vector, which is the 
output of f (1, 9) and the input for the regressor. Figure[^shows 
an overview of the process with an input image, the projected 
3D model, the locations used to extract local features and their 
respective location in the input image. 

B. Local Feature based Shape Fitting 

As the cascaded regression method allows to estimate 
arbitrary parameters, we can apply it to estimating the shape 
parameters in local feature space as well. Our motivation is 
that the image data contains information about a face’s shape, 
and we want to reconstruct the model’s shape parameters for 
the subject in a given image. Similar to the previous section, 
we select a number of 3D vertices Vi, but instead of using 
the mean mesh, we generate a face instance using the current 
estimated shape coefficients and then use these identity-specific 
vertex coordinates to project to 2D space. 

More specifically, we construct a matrix V G M^^nxm-i 
selecting the rows of the PCA basis matrix V corresponding 
to the n chosen vertex points. A face shape is generated with 
the formula in equation using V and the current estimate 



Fig. 1. 3D Morphable Model fitting using local features, (a) Input image, (b) The 3D Morphable Model is projected using the current set of shape and pose 
parameters. The circles represent the new locations at which local features are extracted, (c) The local feature regions in the input image, where the features are 
extracted from. These are used to update the model parameters, and then the process is repeated. 


of a. The parameter vector 9 is then extended to incorporate 
the shape coefficients: 9 = [rx,ry,rz,tx,ty,tz,a]'^. 

Given a new image with a face, and initial landmark 
locations, initial values for the pose parameter part of 9 are 
calculated. The shape initialisation is started at the mean face 
(ai = 0 Vi). The model is projected using this initial estimate, 
and local features are extracted at the projected 2D locations. 
Using these features, the regressor predicts a parameter update 
59, and the process is repeated using the new parameter 
estimate. 

While in our case, the points on the mesh we selected 
coincide with distinctive facial feature points like the eye or 
mouth corners, any vertex from the 3D model can be used, 
for example, the points could be spaced equidistant on the 3D 
face mesh. 

It is noteworthy that our method does not rely on particular 
detected 2D landmarks. A rough initialisation is sufficient to 
run the cascaded regression. In essence, the proposed method 
can also be seen as a 3D model based landmark detection, 
steering the 3D model parameters to converge to the ground 
truth location. This presents a novel step towards unifying 
landmark detection and 3D Morphable Model fitting. 

IV. Experimental Results 

In the following, we present results of the learning based 
cascaded regression method for pose fitting. Accurate ground 
truth is obtained by generating synthetic data using the 3D 
Morphable Model. To simulate more realistic conditions, for 
each image, a random background is chosen from a back¬ 
grounds dataset. 

In a second experiment, we perform simultaneous shape 
and pose fitting using the same method. We subsequently 
compare the proposed method with POSIT im and present 
results on the PIE database. 



Fig. 2. Cascaded regression based 3D Morphable Model fitting. Evaluation of 
pose fitting under different initialisations, (red solid): Initialisation uniformly 
distributed around the ground truth, (blue dashed): Performance in case of 
bad initialisation. 


used with 5 different backgrounds. Parameter initialisations 
are generated by perturbing every angle of each sample with 
a value uniformly drawn from the interval [—11°,-1-11°]. 

The cascaded regression is tested on the same angle 
range, but we sample the test data at a finer resolution of 
5° intervals to verify correct approximation of the learned 
function. Eigure shows the mean absolute error of all three 
predicted angles on the test data. The optimisation is initialised 
in two different ways: first, with values uniformly drawn in the 
interval [—11°,-|-11°] around the ground truth, and second, 
to evaluate the performance in the case of bad initialisation, 
the samples are placed 11° away from the ground truth in all 
the images. In both cases, the algorithm converges, and each 
regressor step promotes further convergence. 


A. Pose Fitting 


B. Simultaneous Shape- and Pose Fitting 


In this first experiment, we evaluate our method by esti¬ 
mating the 3D pose from the extracted local image features. To 
train the cascaded regression, we generate poses from —30° to 
-1-30° yaw and pitch variation, in 10° intervals. Additionally, 
Gaussian noise with a = 1.5mm in x and y translation is added 
to simulate imprecise initialisation. The model is placed at an 
initial z-location of —1200mm and the focal length was fixed 
to 1500. Each sample generated in this way is duplicated and 


The cascaded regression method allows us to simultane¬ 
ously estimate the shape parameters together with the pose 
parameters in a unified way. The parameter vector 6 is ex¬ 
tended to include the first two PCA shape coefficients of 
the Morphable Model: 9 = [r^;, r^, r 2 , fy, U, ao, ai]^. We 
generate test data following the same protocol as described in 
Section IV-A with the addition that the identity of each test 
sample is learned as well. 













Fig. 3. Simultaneous shape- and pose fitting of a 3DMM using cascaded 
regression. Evaluation on Morphable Model generated data, (blue dashed): 
Shape cosine angle between ground truth and prediction (between 0 and 1, 1 
is best), (red solid): Mean absolute en'or of the pose prediction. 

Figure shows the results of the joint shape- and pose 
htting. The shape fitting accuracy is measured by the cosine 
angle between the coefficients of the ground truth a.g and 
the estimated face cte'. d = y ■ A shape similarity of 

0.87 is achieved, and the pose estimation slightly improves 
compared to the previous experiment, because the pose can 
be more accurately estimated when the shape is allowed to 
change. Similar to the first experiment, this experimental study 
shows promising results for the local feature based fitting 
method. 

C. Comparison against POSIT 

To relate the performance of our method to existing so¬ 
lutions in the literature, we compare the pose-htting part of 
our algorithm against POSIT (Pose from Orthography and 
Scaling with ITerations). POSIT estimates the pose from a 
set of 2D to 3D correspondences. In our case, the 2D points 
are the ground truth landmarks. Tested on the same data as 
in Section |IV-A[ POSIT achieves a mean absolute error over 
all angles of 1.84°, compared to around 2° for our method. 
It should be noted that POSIT achieves these results with 
very accurate ground truth landmarks, which are often not 
available in practical applications. If Gaussian noise of 5 pixels 
is added to the landmarks, the accuracy of POSIT drops to 
3.68°. In contrast to that, the proposed algorithm does not 
rely on detected landmarks and their accuracy. 

D. Evaluation on PIE 

To evaluate the proposed method on real data, we use the 
fitting results on the PIE database ll20l that are provided with 
the Basel Face Model (BFM, Bril ') as ground truth data. In 
this way, we can evaluate how well the proposed method 
can approximate a complex state of the art fitting method. 
In particular, to evaluate how the proposed method is able 
to estimate the shape and pose under changing illumination 
conditions, we split the available PIE data into illuminations 
1 to 17 for training and 18 to 22 for testing, resulting in 
3468 training and 1020 test images. Again, a 3-stage cascaded 
regressor is learned. 



Fig. 4. Simultaneous shape- and pose fitting using cascaded regression. 
Evaluation on the PIE database, (blue dashed, green dash-dotted): Shape 
cosine angle between ground tmth and prediction, (red solid, magenta dotted): 
Mean absolute error of the pose prediction over all angles. 

Figure shows the mean absolute angle error on the 
training and test set, as well as the cosine angle of the 
shape coefficients. It can be seen that the test error is only 
marginally higher than the training error, and the proposed 
method generalises well to unseen illumination conditions. On 
the test data, the angle is estimated with an average error of 
0.86°, and a shape similarity of 0.84 is achieved. 

E. Runtime 

Cascaded regression methods are inherently fast, and so 
is the proposed local feature based Morphable Model fitting. 
Estimating the pose and shape requires around 200 millisec¬ 
onds per image, with an unoptimised implementation. The 
runtime largely depends on the number of vertices that are 
used for local feature extraction, as they directly influence the 
dimensionality of the feature vector (and thus the prediction 
time of the regressor). The algorithm could further be sped up 
by using faster feature extraction or compressing the feature 
vectors using PC A, but it has, already in its current form, the 
potential to be used in real-time applications and in a large- 
scale fashion. 


V. Conclusion 

We proposed a new way to fit 3D Morphable Models 
that uses local image features. We overcome the obstacle 
of solving a cost function that contains a non-differentiable 
feature extraction operator by using a learning-based cascaded 
regression method that learns the gradient direction from data. 
We evaluated the proposed method on accurate Morphable 
Model generated data as well as on real images. The method 
yields promising results, as local features have been proven 
robust in many vision algorithms. In contrast to many ex¬ 
isting fitting algorithms, our algorithm achieves near real¬ 
time performance. Future work includes further evaluation of 
how the proposed method can unify landmark detection and 
3D Morphable Model fitting, and fitting of more shape and 
possibly albedo coefficients. 
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